{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a952a54c",
   "metadata": {},
   "source": [
    "# Creating a Llama 3.1 LoRA adapter with NeMo Framework using a Synthetic Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63b3bd1c",
   "metadata": {},
   "source": [
    "This notebook showcases performing LoRA finetuning on **Llama 3.1-8B-Instruct** with a synthetically augmented version of [Law StackExchange](https://huggingface.co/datasets/ymoslem/Law-StackExchange) dataset using NeMo Framework. Law StackExchange is a dataset of legal question/answers. Each record consists of a question, its title, as well as human-provided answers.\n",
    "\n",
    "For this demonstration, we will tune the model on the task of title/subject generation, that is, given a Law StackExchange forum question, auto-generate an appropriate title for it.\n",
    "\n",
    "> `NOTE:` Ensure that you run this notebook inside the [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) which has all the required dependencies. **Instructions are available in the associated tutorial README to download the model and the container.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92ba5569",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install ipywidgets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1cf3dc30",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "import numpy as np\n",
    "from rouge_score import rouge_scorer, scoring"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b129833",
   "metadata": {
    "tags": []
   },
   "source": [
    "---\n",
    "## Before you begin\n",
    "Ensure you have the following -\n",
    "1. **Generate the synthetic dataset**: Follow the [PEFT Synthetic Data Generation (SDG)](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg) tutorial to obtain the synthetic dataset. Once obtained, you must follow the instructions in the associated README to mount it in the NeMo FW container."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8492edc1",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "DATA_DIR = os.path.join(\"/workspace/curated-data\")\n",
    "\n",
    "!ls {DATA_DIR}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97990ad8",
   "metadata": {},
   "source": [
    "You should see the `law-qa-{train/val/test}.jsonl` splits resulting from following the abovementioned SDG tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63b061f4",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "TRAIN_DS = os.path.join(DATA_DIR, \"law-qa-train.jsonl\")\n",
    "VAL_DS = os.path.join(DATA_DIR, \"law-qa-val.jsonl\")\n",
    "TEST_DS = os.path.join(DATA_DIR, \"law-qa-test.jsonl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1c0f9b8",
   "metadata": {
    "tags": []
   },
   "source": [
    "2. **Get the model**: Download the `Meta Llama 3.1 8B Instruct .nemo` model and mount the corresponding folder to the container."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3728f222",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!ls /workspace/llama-3_1-8b-instruct-nemo_v1.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56b7a698",
   "metadata": {
    "tags": []
   },
   "source": [
    "3. **Set the Hugging Face Access Token**: You can obtain this from your [Hugging Face account](https://huggingface.co/docs/hub/en/security-tokens). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b546cb59",
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import login\n",
    "\n",
    "login(token=\"<YOUR_HF_ACCESS_TOKEN>\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "570025c5",
   "metadata": {
    "tags": []
   },
   "source": [
    "---\n",
    "##  Step-by-step instructions\n",
    "\n",
    "This notebook is structured into four steps:\n",
    "1. Prepare the dataset\n",
    "2. Run the PEFT finetuning script\n",
    "3. Inference with NeMo Framework\n",
    "4. Check the model accuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3894607a",
   "metadata": {},
   "source": [
    "### Step 1: Prepare the dataset\n",
    "\n",
    "This dataset has already undergone several filtering and processing operations, and it can be used to train the model for various different tasks - question title generation (summarization), law domain question answering, and question tag generation (multi-label classification).\n",
    "\n",
    "Take a look at a single row in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6b47e31",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# TRAIN, VAL and TEST splits all follow the same structure\n",
    "!head -n1 {TRAIN_DS}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bed493d",
   "metadata": {},
   "source": [
    "You will see several fields in the `.jsonl`, including `title`, `question`, `answer`, and other associated metadata.\n",
    "\n",
    "For this tutorial, our input will be the `answer` field, and output will be it's `title`. \n",
    "\n",
    "The following cell does two things -\n",
    "* Adds a template - a prompt instruction (which is optional), and format `{PROMPT} \\nQUESTION: {data[\"question\"]} \\nTITLE: `.\n",
    "* Saves the data splits into the same location, also appending a `_preprocessed` marker to them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "188b93b7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Add a prompt instruction.\n",
    "PROMPT='''Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more.'''\n",
    "\n",
    "# Creates a preprocessed version of the data files\n",
    "for input_file in [TRAIN_DS, VAL_DS, TEST_DS]:\n",
    "    output_file = input_file.rsplit('.', 1)[0] + '_preprocessed.jsonl'\n",
    "    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:\n",
    "        for line in infile:\n",
    "            # Parse each line as JSON\n",
    "            data = json.loads(line)\n",
    "\n",
    "            # Create a new dictionary with only the desired fields, renamed and formatted\n",
    "            new_data = {\n",
    "                \"input\": f'''{PROMPT} \\nQUESTION: {data[\"question\"]} \\nTITLE: ''',\n",
    "                \"output\": data['title']\n",
    "            }\n",
    "\n",
    "            # Write the new data as a JSON line to the output file\n",
    "            json.dump(new_data, outfile)\n",
    "            outfile.write('\\n')  # Add a newline after each JSON object\n",
    "\n",
    "    print(f\"Processed {input_file} and created {output_file}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39388cc3",
   "metadata": {},
   "source": [
    "After running the above scripts, you will see  `law-qa-{train/test/val}_preprocessed.jsonl` files appear in the data directory.\n",
    "\n",
    "This is what an example will be formatted like -\n",
    "\n",
    "```json\n",
    "{\"input\": \"Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \\nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \\nTITLE: \", \n",
    " \"output\": \"What constitutes \\\"doing business in a jurisdiction?\\\"\"}\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f53038ad",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# clear up any cached mem-map file\n",
    "!rm curated-data/*idx*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebd28f0d",
   "metadata": {},
   "source": [
    "\n",
    "### Step 2: Run PEFT finetuning script for LoRA\n",
    "\n",
    "NeMo framework includes a high level python script for fine-tuning  [megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py) that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!\n",
    "\n",
    "For this demonstration, this training run is capped by `max_steps`, and validation is carried out every `val_check_interval` steps. If the validation loss does not improve after a few checks, training is halted to avoid overfitting.\n",
    "\n",
    "> `NOTE:` In the block of code below, pass the paths to your train, test and validation data files as well as path to the .nemo model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15228de7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# Set paths to the model, train, validation and test sets.\n",
    "MODEL=\"/workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo\"\n",
    "\n",
    "TRAIN_DS=\"[./curated-data/law-qa-train_preprocessed.jsonl]\"\n",
    "VALID_DS=\"[./curated-data/law-qa-val_preprocessed.jsonl]\"\n",
    "TEST_DS=\"[./curated-data/law-qa-test_preprocessed.jsonl]\"\n",
    "TEST_NAMES=\"[law]\"\n",
    "\n",
    "SCHEME=\"lora\"\n",
    "TP_SIZE=1\n",
    "PP_SIZE=1\n",
    "\n",
    "OUTPUT_DIR=\"./results/Meta-llama3.1-8B-Instruct-titlegen\"\n",
    "rm -r $OUTPUT_DIR\n",
    "\n",
    "torchrun --nproc_per_node=1 \\\n",
    "/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \\\n",
    "    exp_manager.exp_dir=${OUTPUT_DIR} \\\n",
    "    exp_manager.explicit_log_dir=${OUTPUT_DIR} \\\n",
    "    trainer.devices=1 \\\n",
    "    trainer.num_nodes=1 \\\n",
    "    trainer.precision=bf16-mixed \\\n",
    "    trainer.val_check_interval=0.2 \\\n",
    "    trainer.max_steps=1000 \\\n",
    "    model.megatron_amp_O2=True \\\n",
    "    ++model.mcore_gpt=True \\\n",
    "    model.tensor_model_parallel_size=${TP_SIZE} \\\n",
    "    model.pipeline_model_parallel_size=${PP_SIZE} \\\n",
    "    model.micro_batch_size=1 \\\n",
    "    model.global_batch_size=32 \\\n",
    "    model.restore_from_path=${MODEL} \\\n",
    "    model.data.train_ds.file_names=${TRAIN_DS} \\\n",
    "    model.data.train_ds.concat_sampling_probabilities=[1.0] \\\n",
    "    model.data.validation_ds.file_names=${VALID_DS} \\\n",
    "    model.peft.peft_scheme=${SCHEME}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "268e4618",
   "metadata": {},
   "source": [
    "This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/`. We'll use this later.\n",
    "\n",
    "To further configure the run above -\n",
    "\n",
    "* **A different PEFT technique**: The `peft.peft_scheme` parameter determines the technique being used. In this case, we did LoRA, but NeMo Framework supports other techniques as well - such as P-tuning, Adapters, and IA3. For more information, refer to the [PEFT support matrix](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/peft/landing_page.html). For example, for P-tuning, simply set \n",
    "\n",
    "```bash\n",
    "model.peft.peft_scheme=\"ptuning\" # instead of \"lora\"\n",
    "```\n",
    "You can override many such configurations (such as `learning rate`, `adapter dim`, and more) while running the script. A full set of possible configurations is available in [NeMo Framework Github](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fed5465",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Step 3: Inference with NeMo Framework\n",
    "\n",
    "Running text generation within the framework is also possible with running a Python script. Note that is more for testing and validation, not a full-fledged  deployment solution like NVIDIA NIM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18a5adfc",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Check that the LORA model file exists\n",
    "!ls -l ./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50b1dc9b",
   "metadata": {},
   "source": [
    "In the code snippet below, the following configurations are worth noting - \n",
    "\n",
    "1. `model.restore_from_path` to the path for the Meta-Llama-3.1-8B-Instruct.nemo file.\n",
    "2. `model.peft.restore_from_path` to the path for the PEFT checkpoint that was created in the fine-tuning run in the last step.\n",
    "3. `model.test_ds.file_names` to the path of the preprocessed test file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0bd9d602",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create a smaller test subset for a quick eval demonstration.\n",
    "\n",
    "!head -n 128 ./curated-data/law-qa-test_preprocessed.jsonl > ./curated-data/law-qa-test_preprocessed-n128.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d029b0b5",
   "metadata": {},
   "source": [
    "If you have made any changes in model or experiment paths, please ensure they are configured correctly below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "630c0305",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "MODEL=\"/workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo\"\n",
    "\n",
    "TEST_DS=\"[./curated-data/law-qa-test_preprocessed-n128.jsonl]\" # Smaller test split\n",
    "# TEST_DS=\"[./curated-data/law-qa-test_preprocessed.jsonl]\" # Full test set\n",
    "TEST_NAMES=\"[law]\"\n",
    "\n",
    "TP_SIZE=1\n",
    "PP_SIZE=1\n",
    "\n",
    "# This is where your LoRA checkpoint was saved\n",
    "PATH_TO_TRAINED_MODEL=\"/workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning.nemo\"\n",
    "\n",
    "# The generation run will save the generated outputs over the test dataset in a file prefixed like so\n",
    "OUTPUT_PREFIX=\"law_titlegen_lora\"\n",
    "\n",
    "python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \\\n",
    "    model.restore_from_path=${MODEL} \\\n",
    "    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \\\n",
    "    trainer.devices=1 \\\n",
    "    trainer.num_nodes=1 \\\n",
    "    model.data.test_ds.file_names=${TEST_DS} \\\n",
    "    model.data.test_ds.names=${TEST_NAMES} \\\n",
    "    model.data.test_ds.global_batch_size=32 \\\n",
    "    model.data.test_ds.micro_batch_size=1 \\\n",
    "    model.data.test_ds.tokens_to_generate=25 \\\n",
    "    model.tensor_model_parallel_size=${TP_SIZE} \\\n",
    "    model.pipeline_model_parallel_size=${PP_SIZE} \\\n",
    "    inference.greedy=True  \\\n",
    "    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \\\n",
    "    model.data.test_ds.write_predictions_to_file=True \\\n",
    "    model.data.test_ds.truncation_field=\"null\" \\\n",
    "    model.data.test_ds.add_bos=False \\\n",
    "    model.data.test_ds.add_eos=True \\\n",
    "    model.data.test_ds.add_sep=False \\\n",
    "    model.data.test_ds.label_key=\"output\" \\\n",
    "    model.data.test_ds.prompt_template=\"\\{input\\}\\ \\{output\\}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "513cd732",
   "metadata": {},
   "source": [
    "### Step 4: Check the model accuracy\n",
    "\n",
    "Now that the results are in, let's read the results and calculate the accuracy on the question title generation task.\n",
    "Let's take a look at one of the predictions in the generated output file. The `pred` key indicates what was generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04cb4ae7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Take a look at predictions\n",
    "!head -n1  law_titlegen_lora_test_law_inputs_preds_labels.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e88f3c9",
   "metadata": {
    "tags": []
   },
   "source": [
    "For evaluating this task, we will use [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)).  It measures overlap of ngrams, and a higher score is better. While it's not perfect and it misses capturing the semantics of the prediction, it is a popular metric in academia and industry for evaluating such systems. \n",
    "\n",
    "The following method uses the `rouge_score` library to implement scoring. It will report `ROUGE_{1/2/L/Lsum}` metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4aa9631",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def compute_rouge(input_file: str) -> dict:\n",
    "    ROUGE_KEYS = [\"rouge1\", \"rouge2\", \"rougeL\", \"rougeLsum\"]\n",
    "    scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=True)\n",
    "    aggregator = scoring.BootstrapAggregator()\n",
    "    lines = [json.loads(line) for line in open(input_file)]\n",
    "    num_response_words = []\n",
    "    num_ref_words = []\n",
    "    for idx, line in enumerate(lines):\n",
    "        prompt = line['input']\n",
    "        response = line['pred']\n",
    "        answer = line['label']\n",
    "        scores = scorer.score(response, answer)\n",
    "        aggregator.add_scores(scores)\n",
    "        num_response_words.append(len(response.split()))\n",
    "        num_ref_words.append(len(answer.split()))\n",
    "\n",
    "    result = aggregator.aggregate()\n",
    "    rouge_scores = {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}\n",
    "    print(rouge_scores)\n",
    "    print(f\"Average and stddev of response length: {np.mean(num_response_words):.2f}, {np.std(num_response_words):.2f}\")\n",
    "    print(f\"Average and stddev of ref length: {np.mean(num_ref_words):.2f}, {np.std(num_ref_words):.2f}\")\n",
    "\n",
    "    return rouge_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "41c661d5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "compute_rouge(\"./law_titlegen_lora_test_law_inputs_preds_labels.jsonl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f48667b1",
   "metadata": {},
   "source": [
    "For the Llama-3.1-8B-Instruct model, you should see accuracy comparable to the below:\n",
    "```\n",
    "{'rouge1': 39.2082, 'rouge2': 18.8573, 'rougeL': 35.4098, 'rougeLsum': 35.3906}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
